### Lab 7 - Samples and boxplots

A *population* is everyone or everything that we want to study.  For example, if we want to study the blood pressure of New Yorkers, the population would be everyone living in New York.

A *sample* is the subset of the population for which we have observations/measurements/data.  For example, for the blood pressure study, it's impractical to measure the blood pressure of everyone living in New York, so we might just randomly select 100 New Yorkers and measure their blood pressure.  Those 100 people are the sample.

In this lab, we will investigate how the distribution of a sample compares to the distribution of its population.  Recall that the distribution of a sample or population is the precise description of the possible data values along with their frequencies (how often a value occurs).  In previous labs, we learned how to visualize distributions using bar charts and histograms, and today we will learn a third way, called a *boxplot*.

We will use the Green Taxi Trip dataset from Labs 3, 4, and 6 and the MoMA artwork dataset from Lab 5. 

As usual, we will import the matplotlib and pandas packages, and set plots to appear in the Jupyter notebook.

In [1]:
import matplotlib.pyplot as plt
import pandas as pd
%matplotlib inline

### Samples of a quantitative variable (green taxi trip distances)

We will use the green taxi dataset first, so read that CSV file into a dataframe named `taxi`.

Check that the dataframe was created properly by displaying it.

Let's assume that this dataset is our *population*, meaning we are interested only in the taxi trips contained in this dataset and no other trips.  The *distribution* of the trip distance for this population can be visualized by the histogram of the trip distance.  Write code below to generate a histogram with 40 bins of the trip distances in `taxi`.

<details> <summary>Pattern:</summary>
    <code>dataframe_name["column_name"].hist()</code>
</details>

Now we will take a random sample of 10 trips and plot the histogram of this sample.  Type the code `sample10 = taxi.sample(10)` below and run it.

This code randomly selects 10 rows from the dataframe `taxi` and copies them into a new dataframe called `sample10`.  To see this, display the dataframe `sample10` below.

Make a histogram of the trip distances in the `sample10` dataframe below:

<details> <summary>Answer:</summary>
    <code>sample10["trip_distance"].hist()</code>
</details>

How does this histogram of the sample trip distances compare to the histogram of the population distances? 

Let's try a bigger sample of size 50.  Write code below to take a sample of 50 rows from the dataframe `taxi` and plot the histogram of these sample trip distances.

<details> <summary>Answer:</summary>
    <code>sample50 = taxi.sample(50)
sample50["trip_distance"].hist()</code>
</details>

How does this histogram of the sample trip distances compare to the histogram of the population trip distances?

What happens if you re-run this code?  Why?

Let's try an even larger sample size of 200.  Write code below to take a sample of size 200 from the `taxi` dataframe and plot the histogram of these sample trip distances.  Try different numbers of bins, and use the one that seems to give the most informative histogram.

How does this histogram compare to the histogram of the population?

Take a final sample of 1000 trips.  Can you predict how the trip distance histogram will change from the 200 sample one? 

Plot the histogram of these 1000 trip distances below.  Use the number of bins that gives the most informative histogram.

How does this histogram compare to the histogram of the population?  As the sample size increased, how did the histograms change?

### Samples of a qualitative variable (gender of MoMA artists)

Now we will look at samples from the MoMA artworks dataset from Lab 5.  Load the CSV file into the dataframe `art`, and check that it was read in correctly.

We are going to look at the distribution of the artists' gender.  First write code to count the number of each value in the gender column and display the counts.  (Don't make a bar chart yet!)

<details> <summary>Pattern:</summary>
    <code>dataframe_name["column_name"].value_counts()</code>
</details>

What do you notice about the values?  

Some of the art works have multiple artists, so there are a lot of values beyond male and female.  The code below will create a new dataframe called `solo_art` that contains only works by a single artist.  We will learn how to write these filters in the next two classes, but for now, just run the code.

In [None]:
male_filter = art["Gender"] == "(Male)"
female_filter = art["Gender"] == "(Female)"
neither_filter =  art["Gender"] == "()"
solo_art = art[male_filter | female_filter | neither_filter]

Write code below to count the number of artists with each gender in the new dataframe `solo_art`.

<details> <summary>Answer:</summary>
    <code>solo_art["Gender"].value_counts()</code>
</details>

Next, make a bar chart of the gender counts (you will have to save the counts as a variable).

<details> <summary>Pattern:</summary>
    <code>counts_variable = dataframe_name["column_name"].value_counts()
counts_variable.plot(kind = "bar")</code>
</details>

What do you notice about this gender distribution?  We use all MoMA artworks by solo artists as our population, and so this bar chart visualizes the population distribution.

Now let's take a sample of 10 rows from the `solo_art` dataframe.  Can you figure out how to do this?

<details> <summary>Answer:</summary>
    <code>art_sample10 = solo_art.sample(10)</code>
</details>

Write code to create a bar chart of the gender of the artists in this sample:

<details> <summary>Answer:</summary>
    <code>counts10 = art_sample10["Gender"].value_counts()
counts10.plot(kind = "bar")</code>
</details>

How does the distribution of this sample compare to the population distribution?

Let's take a larger sample of size 50 from the `solo_art` dataframe and plot a bar chart of the genders in this sample.

<details> <summary>Answer:</summary>
    <code>art_sample50 = solo_art.sample(50)
counts50 = art_sample50["Gender"]
counts50.value_counts().plot(kind = "bar")</code>
</details>

How does the distribution of this sample compare to the population distribution and the previous sample?

What happens if you re-run this code?  Why?

Now let's try a sample of size 200.  Take the sample from the `solo_art` dataframe and make a bar chart of the genders in this sample.

How does the distribution of this sample compare to the population distribution and the previous samples?

Finally, let's try a sample of size 1000.  Take the sample from the `solo_art` dataframe and make a bar chart of the genders in this sample.

How does the distribution of this sample compare to the population distribution?  How did the bar charts change as the sample size increased?  Is this the same behavior you saw with the histogram of trip distances and the previous sample distributions?

### Box plots

Finally, let's look at another way to visualize the distribution of quantitative data with a *box plot*.  Type the code `taxi["trip_distance"].plot(kind = "box")` below and run it.

A box plot shows 5 aspects of a distribution:
   - the minimum value
   - the 25\% percentile (25\% of the data is less than this value)
   - the median
   - the 75\% percentile (75\% of the data is less than this value; equivalently 25\% of the data is greater than this value)
   - the maximum value
   
Can you figure out what parts of the box plot correspond to these 5 numbers?

#### Challenges:
- What happens if you take an even larger sample size than 1000?  (for either or both datasets)
- Rerun the code that takes a random sample and plot the distribution as a histogram or bar chart for sample size 50 and sample size 1000.  How much do the plots change?  Why?
- Make a box plot of another column with quantitative data.